%load_ext pretty_jupyter
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import geopandas as gpd
from datetime import timedelta
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
1. Introduction¶
This document is a technical report for a semestral project in the MLP course, which is taught at the Faculty of Information Technology, CTU.
1.1. Project description¶
This project aims to analyze the provided dataset containing information about earthquakes from all around the world in the period from 1990 to 2023 and based on the gathered information, adjust the dataset and create a simple machine learning model that will be able to predict the significance level (low, medium, high) of an earthquake based on the provided features such as the magnitude, depth, and location of the earthquake.
1.2. Hypotheses formulation¶
In the end, the gained insights and the model evaluation should provide enough information to reliably decide whether the following hypothesis is true or false:
- A. Higher magnitude earthquakes are more likely to be classified with higher significance.
- B. Earthquakes with lower depth are more likely to be classified with higher significance.
1.3. Outline¶
The report is divided into several sections, each focusing on a different aspect of the project:
- Introduction: This section provides an overview of the project, including the project description, hypotheses formulation, and outline.
- Data analysis: This section focuses on analyzing the provided dataset, including a dataset description, initial exploration, and advanced exploration. The initial exploration examines the dataset's structure and contents, analyzing each data column to understand its significance and characteristics. The advanced exploration delves deeper into the most relevant columns identified during the initial exploration, providing a more detailed analysis of these key aspects.
- Data preprocessing: This section focuses on preparing the dataset for the machine learning model. It includes data cleaning and data transformation.
- Model engineering: This section focuses on creating and evaluating the machine learning model. It includes preparing the training and testing datasets, creating the model, training the model, and evaluating its performance.
- Conclusion: This section provides a summary of the project, including the results of the data analysis and model engineering, as well as the conclusions drawn from the project.
2. Data analysis¶
This section focuses on analyzing the provided dataset. The analysis is divided into three parts:
2.1 Dataset Description: This section presents a summary of the dataset as described by its creator. 2.2 Initial Exploration: This section examines the dataset’s structure and contents, analyzing each data column to understand its significance and characteristics. 2.3 Advanced Exploration: This section delves deeper into the most relevant columns identified during the initial exploration, providing a more detailed analysis of these key aspects.
2.1. Dataset description¶
Description provided by the dataset provider:
The earthquakes dataset is an extensive collection of data containing information about all the earthquakes recorded worldwide from 1990 to 2023. The dataset comprises approximately three million rows, with each row representing a specific earthquake event. Each entry in the dataset contains a set of relevant attributes related to the earthquake, such as the date and time of the event, the geographical location (latitude and longitude), the magnitude of the earthquake, the depth of the epicenter, the type of magnitude used for measurement, the affected region, and other pertinent information.
Columns in data files:
- time: unix timestamp in milliseconds
- place: geographical location
- status: Represents the current state or condition of the event, which could be reviewed or automatic
- tsunami: Relates to a series of large ocean waves typically caused by an underwater disturbance, often associated with earthquakes.
- significance: Denotes the importance or impact level of the event, which could be used to assess the potential consequences.
- data_type: Specifies the type of data being referenced
- magnitudo: Refers to the measurement of the size or intensity of an earthquake, typically measured on the Richter or moment magnitude scale.
- state: Represents the administrative division or state where the event occurred, often applicable to specific countries.
- longitude: coordinate float
- latitude: coordinate float
- depth
- date
2.2. Initial exploration¶
This section focuses on the initial exploration of the dataset.
Summary:
- The dataset contains 3,445,751 rows and 11 columns.
- The columns are: 'time', 'place', 'status', 'tsunami', 'significance', 'data_type', 'magnitudo', 'state', 'longitude', 'latitude', 'depth', 'date'.
- There are 33,734 (0.98%) duplicated rows.
- Columns:
- 'time':
- Contains Unix timestamps in milliseconds.
- The longest time without an event observation is 62.1 days.
- Redundant with the 'date' column → can be removed.
- 'date':
- Contains two date formats (with and without milliseconds).
- Can be standardized and converted to datetime.
- Only one time zone (UTC).
- Not relevant for this analysis → can be removed.
- 'place':
- Inconsistently describes the location of the earthquake.
- Not relevant for this analysis → can be removed.
- 'status':
- Contains inconsistent classifications of the event (e.g., reviewed, automatic).
- Not relevant for this analysis → can be removed.
- 'tsunami':
- Binary information indicating whether the earthquake generated a tsunami.
- Can be converted to boolean.
- Started to be recorded in the dataset in 2013.
- Not relevant for this analysis → can be removed.
- 'significance':
- Describes the impact level of the event.
- Limited information available about its calculation.
- Can be converted to uint16 or categorized (low, medium, high).
- 'data_type':
- Contains information about the event type (e.g., earthquake, quarry blast, explosion).
- Only 'earthquake' is relevant → other types can be filtered out.
- 'magnitudo':
- Contains the earthquake magnitude.
- Values are consistent with the Richter or moment magnitude scales.
- Can be converted to float16 to save memory.
- Magnitudes smaller than -3 are likely errors and should be excluded.
- 'state':
- Contains information about the state where the event occurred.
- Inconsistent naming (e.g., 'California', 'CA', 'CA earthquake').
- Not relevant for this analysis → can be removed.
- 'longitude':
- Contains the longitude of the event.
- No issues identified.
- Not relevant for this analysis → can be removed.
- 'latitude':
- Contains the latitude of the event.
- No issues identified.
- Not relevant for this analysis → can be removed.
- 'depth':
- Contains the depth of the event in kilometers.
- Depths smaller than -4 km are likely errors and should be excluded.
- Can be converted to float16 to save memory.
- 'time':
# Define helper functions
def prettify_number(number: int | float) -> str:
return "{:,}".format(number)
def print_column_info(column: pd.Series) -> None:
data_type = column.dtype
print(f'Basic information about "{column.name}" column:')
print(f' - Data type: {data_type}')
if data_type == 'int64' or data_type == 'float64':
print(f" - (max; min): ({prettify_number(max(column))}; {prettify_number(min(column))})")
unique_values = column.unique()
print(f" - Number of missing values: {prettify_number(column.isnull().sum())}")
print(f" - Number of unique values: {prettify_number(len(unique_values))}")
print(f" - Percentage of unique values: {round(len(unique_values) / len(column), 4) * 100}%")
print(f" - Unique values (max 10):", end=" ")
for i in range(min(10, len(unique_values))):
if data_type == 'int64' or data_type == 'float64':
print(prettify_number(round(unique_values[i], 4)), end=" | ")
else:
print(unique_values[i], end=" | ")
def print_counts(value_counts: pd.Series, n: int = 10) -> None:
num_counts = len(value_counts)
value_counts = value_counts.head(n)
max_length_value = max([len(str(value)) for value in value_counts.index])
for value, count in value_counts.items():
length_value = len(str(value))
num_spaces = max_length_value - length_value
print(f" -> {value}{' ' * num_spaces} | {prettify_number(count)} ({round(count / len(data) * 100, 2)}%)")
if num_counts > n:
print(f" -> ... and {prettify_number(num_counts - n)} more ({round((num_counts - n) / len(data) * 100, 2)}%)")
def plot_map_with_region(data: pd.DataFrame, region: tuple | None, title: str, zoom: bool, every_nth: int, size: int) -> None:
fig, ax = plt.subplots(figsize=(10, 8))
world = gpd.read_file("../data/maps/ne_110m_admin_0_countries.shp")
world.plot(ax=ax, color='lightgrey')
data = data.iloc[::every_nth, :]
ax.scatter(data['longitude'], data['latitude'], s=size, alpha=1, c=data['magnitudo'], cmap="Reds")
cbar = plt.colorbar(ax.collections[1], ax=ax, orientation='vertical', fraction=0.02)
cbar.set_label('Magnitude')
if region:
left, right, bottom, top = region
if zoom and left > right:
print("The region crosses the 180° meridian. It cannot be zoomed.")
zoom = False
if zoom and left < right:
ax.set_xlim([left, right])
ax.set_ylim([bottom, top])
else:
if left > right:
ax.plot([left, left], [bottom, top], color='blue', linewidth=3)
ax.plot([right, right], [bottom, top], color='blue', linewidth=3)
ax.plot([left, 180], [bottom, bottom], color='blue', linewidth=3)
ax.plot([left, 180], [top, top], color='blue', linewidth=3)
ax.plot([-180, right], [top, top], color='blue', linewidth=3)
ax.plot([-180, right], [bottom, bottom], color='blue', linewidth=3)
else:
ax.plot([left, right], [bottom, bottom], color='blue', linewidth=3)
ax.plot([left, right], [top, top], color='blue', linewidth=3)
ax.plot([left, left], [bottom, top], color='blue', linewidth=3)
ax.plot([right, right], [bottom, top], color='blue', linewidth=3)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title(title)
plt.show()
# Load the dataset
path = '../data/earthquakes.csv'
data = pd.read_csv(path)
# Print the columns of the dataset
print("Columns of the raw dataset:")
for col in data.columns:
print(" ", col)
Columns of the raw dataset: time place status tsunami significance data_type magnitudo state longitude latitude depth date
# Print the number of rows
original_number_of_rows = len(data)
print("\nNumber of rows in the dataset:", prettify_number(original_number_of_rows))
Number of rows in the dataset: 3,445,751
# Print the random 5 rows of the dataset
print("\n5 random rows of the dataset:")
print(data.sample(5))
5 random rows of the dataset:
time place status \
2229389 1421126144725 43 km E of Fort Bidwell, California reviewed
3041852 1606835699710 6 km ESE of La Parguera, Puerto Rico reviewed
1265195 1144479990381 18 km WNW of Fishhook, Alaska reviewed
3071099 1611985731343 40 km NNE of Petersville, Alaska reviewed
2677307 1534648527520 105 km SW of Adak, Alaska reviewed
tsunami significance data_type magnitudo state \
2229389 0 26 earthquake 1.30 California
3041852 0 48 earthquake 1.77 Puerto Rico
1265195 0 26 earthquake 1.30 Alaska
3071099 0 30 earthquake 1.40 Alaska
2677307 0 178 earthquake 3.40 Alaska
longitude latitude depth date
2229389 -119.6251 41.8799 6.50 2015-01-13 05:15:44.725000+00:00
3041852 -66.9901 17.9570 10.00 2020-12-01 15:14:59.710000+00:00
1265195 -149.5431 61.8207 43.50 2006-04-08 07:06:30.381000+00:00
3071099 -150.5944 62.8494 77.20 2021-01-30 05:48:51.343000+00:00
2677307 -177.6675 51.1732 29.84 2018-08-19 03:15:27.520000+00:00
# Get the duplicated rows indices
duplicated_rows_mask = data.duplicated(keep=False)
duplicated_rows_indices = data[duplicated_rows_mask].index
# Get the duplicated rows
duplicated_rows = data[duplicated_rows_mask]
# Check how many duplicated rows are there
num_duplicated_rows = duplicated_rows_mask.sum()
print(f"Number of duplicated rows: {prettify_number(num_duplicated_rows)} ({round(num_duplicated_rows / len(data) * 100, 2)}%)")
# Get unique rows in all duplicated rows
unique_rows = data.loc[duplicated_rows_indices].drop_duplicates()
print(f"Number of unique rows in the duplicated rows: {prettify_number(len(unique_rows))}")
print(f"Each duplicated row is on average {round(len(duplicated_rows) / len(unique_rows), 4)} times in the dataset.")
Number of duplicated rows: 33,734 (0.98%) Number of unique rows in the duplicated rows: 16,865 Each duplicated row is on average 2.0002 times in the dataset.
2.2.1. Time column¶
Summary:
- Timestamps are recorded in milliseconds (Unix timestamp).
- Median time difference between two consecutive events: 178.83 seconds.
- Maximum time difference between two consecutive events: 62.1 days
- An increasing trend in the recorded number of earthquakes is observed. This trend likely reflects advancements in detection technology and an expanded sensor network rather than a genuine increase in earthquake frequency.
# Print the information about the time column
print_column_info(data['time'])
Basic information about "time" column:
- Data type: int64
- (max; min): (1,690,628,937,884; 631,153,353,990)
- Number of missing values: 0
- Number of unique values: 3,428,775
- Percentage of unique values: 99.51%
- Unique values (max 10): 631,153,353,990 | 631,153,491,210 | 631,154,083,450 | 631,155,512,130 | 631,155,824,490 | 631,155,853,760 | 631,156,030,570 | 631,156,431,950 | 631,156,614,070 | 631,156,620,210 |
# Sort the data by date
new_data = data.sort_values('time')
# Convert the time column to datetime
new_data['time'] = pd.to_datetime(new_data['time'], unit='ms')
# Average time difference between two consecutive events
time_diff = new_data['time'].diff().mean()
time_diff = time_diff.total_seconds()
print(f"Average time difference between two consecutive events: {round(time_diff, 2)} seconds")
# Median time difference between two consecutive events
time_diff = new_data['time'].diff().median()
time_diff = time_diff.total_seconds()
print(f"Median time difference between two consecutive events: {round(time_diff, 2)} seconds")
# Minimum time difference between two consecutive events
min_time_diff = new_data['time'].diff().min()
min_time_diff = min_time_diff.total_seconds()
print(f"Minimum time difference between two consecutive events: {min_time_diff} seconds")
# Maximum time difference between two consecutive events
max_time_diff = new_data['time'].diff().max()
max_time_diff = max_time_diff.total_seconds()
print(f"Maximum time difference between two consecutive events: {round(max_time_diff / 60 / 60 / 24, 2)} days")
Average time difference between two consecutive events: 307.47 seconds Median time difference between two consecutive events: 178.83 seconds Minimum time difference between two consecutive events: 0.0 seconds Maximum time difference between two consecutive events: 62.01 days
The following graph illustrates the number of earthquakes recorded over time. Multiple peaks and valleys are visible in the data; however, these fluctuations are not analyzed further in this report. Additionally, an increasing trend in the recorded number of earthquakes is evident, likely reflecting advancements in detection technology and an expanded sensor network.
# Plot the histogram of the time column
plt.figure(figsize=(10, 5))
plt.hist(new_data['time'], bins=100, alpha=1, color='blue')
plt.title('Time Histogram')
plt.xlabel('Time')
plt.ylabel('Number of Earthquakes')
plt.grid()
plt.show()
2.2.2. Date column¶
Summary:
- The dataset contains two date formats: with and without milliseconds.
- All timestamps are in a single time zone (UTC).
- The 'date' and 'time' columns are consistent with each other.
- Conversion to datetime format is feasible.
# Print the information about the date column
print_column_info(data['date'])
Basic information about "date" column:
- Data type: object
- Number of missing values: 0
- Number of unique values: 3,428,775
- Percentage of unique values: 99.51%
- Unique values (max 10): 1990-01-01 00:22:33.990000+00:00 | 1990-01-01 00:24:51.210000+00:00 | 1990-01-01 00:34:43.450000+00:00 | 1990-01-01 00:58:32.130000+00:00 | 1990-01-01 01:03:44.490000+00:00 | 1990-01-01 01:04:13.760000+00:00 | 1990-01-01 01:07:10.570000+00:00 | 1990-01-01 01:13:51.950000+00:00 | 1990-01-01 01:16:54.070000+00:00 | 1990-01-01 01:17:00.210000+00:00 |
# Print unique time zones
time_zones = data['date'].apply(lambda x: x[-1])
print("Unique time zones:", end="")
for tz in time_zones.unique():
if tz == '0':
print(" UTC", end="")
else:
print(f" UTC{tz}", end="")
Unique time zones: UTC
# Print the two date formats found in the dataset
# Convert the date column to datetime
date_time = pd.to_datetime(data['date'], errors='coerce')
# Get the valid and non-valid dates
non_valid_dates = data[date_time.isnull()]['date']
valid_dates = data[date_time.notnull()]['date']
# Get the first occurrences of both date formats
non_valid_date = non_valid_dates.iloc[0]
valid_date = valid_dates.iloc[0]
print("Two date formats found in the dataset:")
print(f" - Date format without milliseconds: {non_valid_date} ({round(len(non_valid_dates) / len(data) * 100, 2)}% of the dataset)")
print(f" - Date format with milliseconds: {valid_date} ({round(len(valid_dates) / len(data) * 100, 2)}% of the dataset)")
Two date formats found in the dataset: - Date format without milliseconds: 1990-01-01 06:34:04+00:00 (1.36% of the dataset) - Date format with milliseconds: 1990-01-01 00:22:33.990000+00:00 (98.64% of the dataset)
# Compare the date and time columns
# Convert the date column to datetime
date_time = pd.to_datetime(data['date'], format='mixed')
# Get both to the same format
time_ms = data['time']
time_ms_from_date = date_time.apply(lambda x: x.timestamp() * 1000)
# Round the values
time_ms = time_ms.round(0)
time_ms_from_date = time_ms_from_date.round(0)
# Convert both to int
time_ms = time_ms.astype(int)
time_ms_from_date = time_ms_from_date.astype(int)
# Compare the two columns
diff = time_ms - time_ms_from_date
print(f"Number of different values between the 'time' and 'date' columns: {diff.astype(bool).sum()}")
Number of different values between the 'time' and 'date' columns: 0
2.2.3. Place column¶
Summary:
- Place names are inconsistent, with examples such as '12 km E of Mammoth Lakes, California' and 'Northern California'.
- The most frequently recorded place is 'California'.
# Print the information about the place column
print_column_info(data['place'])
Basic information about "place" column:
- Data type: object
- Number of missing values: 0
- Number of unique values: 531,130
- Percentage of unique values: 15.409999999999998%
- Unique values (max 10): 12 km NNW of Meadow Lakes, Alaska | 14 km S of Volcano, Hawaii | 7 km W of Cobb, California | 11 km E of Mammoth Lakes, California | 16km N of Fillmore, CA | 6 km WSW of Mammoth Lakes, California | 12 km E of Mammoth Lakes, California | 1 km W of Wilkeson, Washington | 9 km W of Cashmere, Washington | 6 km W of Cobb, California |
# Get only unique places and their counts
places = data['place'].value_counts()
# Sort the places by the number of occurrences
places = places.sort_values(ascending=False)
# Plot the top 10 places as histogram
plt.figure(figsize=(10, 5))
plt.bar(places.index[:10], places.values[:10])
plt.title('10 places with the most earthquakes')
plt.xlabel('Place')
plt.ylabel('Number of earthquakes')
plt.xticks(rotation=45, ha='right')
plt.show()
2.2.4. Status column¶
Summary:
- Contains 6 unique values, though the dataset provider indicates there should only be 2.
- Values such as 'reviewed' and 'automatic' are the most common.
- Merging of similar or synonymous values is possible.
# Print the information about the status column
print_column_info(data['status'])
Basic information about "status" column:
- Data type: object
- Number of missing values: 0
- Number of unique values: 6
- Percentage of unique values: 0.0%
- Unique values (max 10): reviewed | automatic | AUTOMATIC | REVIEWED | MANUAL | manual |
# Check occurrences of each value
status_counts = data['status'].value_counts()
print("Occurrences of each value in the 'status' column:")
print_counts(status_counts)
Occurrences of each value in the 'status' column: -> reviewed | 3,224,825 (93.59%) -> automatic | 205,714 (5.97%) -> REVIEWED | 14,093 (0.41%) -> AUTOMATIC | 1,107 (0.03%) -> manual | 8 (0.0%) -> MANUAL | 4 (0.0%)
2.2.5. Tsunami column¶
Summary:
- Contains 2 unique values (0 and 1), which can be converted to boolean.
- The majority of earthquakes did not generate a tsunami.
- Tsunamis are predominantly associated with earthquakes in the Pacific Ring of Fire.
- Recording of tsunamis in the dataset began around the year 2013.
# Print the information about the tsunami column
print_column_info(data['tsunami'])
Basic information about "tsunami" column:
- Data type: int64
- (max; min): (1; 0)
- Number of missing values: 0
- Number of unique values: 2
- Percentage of unique values: 0.0%
- Unique values (max 10): 0 | 1 |
# Check occurrences of each value
tsunami_counts = data['tsunami'].value_counts()
print("Occurrences of each value in the 'tsunami' column:")
print_counts(tsunami_counts)
Occurrences of each value in the 'tsunami' column: -> 0 | 3,444,223 (99.96%) -> 1 | 1,528 (0.04%)
The following plot displays the coordinates of the events that generated a tsunami on a world map. The events are color-coded based on their magnitude, with larger magnitudes represented by darker shades of red. The map illustrates that tsunamis are predominantly associated with earthquakes in the Pacific Ring of Fire.
# Plot the coordinates of the events that generated a tsunami on a world map
tsunami_data = data[data['tsunami'] == 1]
plot_map_with_region(tsunami_data, None, "Earthquakes that generated a tsunami", zoom=False, every_nth=1, size=10)
The following histogram displays the number of earthquakes that generated a tsunami and those that did not over time. The x-axis represents the year, while the y-axis displays the number of earthquakes in a logarithmic scale. The histogram illustrates that the majority of earthquakes did not generate a tsunami. Additionally, the recording of tsunamis in the dataset began around the year 2013.
plt.figure(figsize=(10, 5))
times_with_tsunami = data[data['tsunami'] == 1]['time']
times_with_tsunami = pd.to_datetime(times_with_tsunami, unit='ms')
times_without_tsunami = data[data['tsunami'] == 0]['time']
times_without_tsunami = pd.to_datetime(times_without_tsunami, unit='ms')
# Plot the histograms
plt.hist(times_without_tsunami, bins=50, alpha=1, label='Tsunami', color='blue')
plt.hist(times_with_tsunami, bins=50, alpha=1, label='Non-tsunami', color='red')
# Format the x-axis ticks to display years
plt.gca().xaxis.set_major_locator(mdates.YearLocator()) # Major ticks every year
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y')) # Format ticks as years
plt.title('Time Histogram for the Tsunami and Non-tsunami Earthquakes [log]')
plt.xlabel('Year')
plt.ylabel('Number of Earthquakes [log]')
plt.yscale('log')
plt.legend()
plt.grid()
# show only every nth x-tick
ticks = plt.gca().get_xticks()
plt.gca().set_xticks(ticks[::5])
plt.xticks(rotation=45, ha='right')
plt.show()
# Print the first occurrences of the tsunamis
first_tsunami = data[data['tsunami'] == 1].iloc[0]
print("First occurrences of the earthquakes that generated a tsunami:")
print(f" - Date: {pd.to_datetime(first_tsunami['time'], unit='ms')}")
First occurrences of the earthquakes that generated a tsunami: - Date: 2013-02-06 01:12:25.830000
2.2.6. Significance column¶
Summary:
- The range of values in the significance column spans from 0 to 2,910, making it possible to convert the column to uint16 to save memory.
- The column can also be converted into a categorical variable (low, medium, high).
- The distribution is highly skewed to the left, indicating that most earthquakes have low significance.
- The two most significant events, with a significance of 2,910, are:
- 2017 Chiapas earthquake (https://en.wikipedia.org/wiki/2017_Chiapas_earthquake)
- 2023 Turkey–Syria earthquakes (https://en.wikipedia.org/wiki/2023_Turkey%E2%80%93Syria_earthquakes)
- The most significant earthquakes are concentrated in regions with high seismic activity and in populated areas.
- There is a peak around the value of 250 that is explained in section 2.3.2. Correlation between the magnitude and significance.
# Print the information about the significance column
print_column_info(data['significance'])
Basic information about "significance" column:
- Data type: int64
- (max; min): (2,910; 0)
- Number of missing values: 0
- Number of unique values: 1,170
- Percentage of unique values: 0.03%
- Unique values (max 10): 96 | 31 | 19 | 15 | 134 | 118 | 20 | 11 | 39 | 74 |
The following histogram displays the significance of the earthquakes in the dataset. The x-axis represents the significance, while the y-axis displays the number of earthquakes in a logarithmic scale. The histogram illustrates that the distribution is highly skewed to the left, indicating that most earthquakes have low significance. Also, a peak is visible around the value of 250.
# Plot the log histogram of the significance column
plt.figure(figsize=(10, 5))
plt.hist(data['significance'], bins=100, alpha=1, color='blue')
plt.title('Significance Histogram [log]')
plt.xlabel('Significance')
plt.ylabel('Number of Earthquakes [log]')
plt.yscale('log')
plt.grid()
plt.show()
# Print the information about the most significant event
max_significance = max(data['significance'])
max_significance_event = data[data['significance'] == max_significance]
print(f"Most significant events:")
for event in max_significance_event.values:
print()
print(f" Event:")
for i in range(len(event)):
if isinstance(event[i], (int, float)):
print(f" - {data.columns[i]}: {prettify_number(round(event[i], 4))}")
else:
print(f" - {data.columns[i]}: {event[i]}")
url1 = "https://en.wikipedia.org/wiki/2017_Chiapas_earthquake"
url2 = "https://en.wikipedia.org/wiki/2023_Turkey%E2%80%93Syria_earthquakes"
print()
print(f" More information:")
print(f" - 1.: {url1}")
print(f" - 2.: {url2}")
Most significant events:
Event:
- time: 1,504,846,159,180
- place: near the coast of Chiapas, Mexico
- status: reviewed
- tsunami: 1
- significance: 2,910
- data_type: earthquake
- magnitudo: 8.2
- state: Mexico
- longitude: -93.8993
- latitude: 15.0222
- depth: 47.39
- date: 2017-09-08 04:49:19.180000+00:00
Event:
- time: 1,675,646,254,342
- place: Pazarcik earthquake, Kahramanmaras earthquake sequence
- status: reviewed
- tsunami: 0
- significance: 2,910
- data_type: earthquake
- magnitudo: 7.8
- state: Kahramanmaras earthquake sequence
- longitude: 37.0143
- latitude: 37.2256
- depth: 10.0
- date: 2023-02-06 01:17:34.342000+00:00
More information:
- 1.: https://en.wikipedia.org/wiki/2017_Chiapas_earthquake
- 2.: https://en.wikipedia.org/wiki/2023_Turkey%E2%80%93Syria_earthquakes
The following map displays the 100 most significant earthquakes in the dataset. The events are color-coded based on their magnitude, with larger magnitudes represented by darker shades of red. The distribution shows that the most significant earthquakes are concentrated in regions with high seismic activity and in populated areas.
# Plot 100 most significant events on a world map
most_significant_data = data.nlargest(100, 'significance')
plot_map_with_region(most_significant_data, None, "100 most significant earthquakes", zoom=False, every_nth=1, size=10)
2.2.7. Data type column¶
Summary:
- Contains 25 unique values, with the most common being 'earthquake', 'quarry blast', and 'explosion'.
- Earthquakes are by far the most common type of event in the dataset.
# Print the information about the data_type column
print_column_info(data['data_type'])
Basic information about "data_type" column:
- Data type: object
- Number of missing values: 0
- Number of unique values: 25
- Percentage of unique values: 0.0%
- Unique values (max 10): earthquake | quarry blast | explosion | other event | nuclear explosion | rock burst | ice quake | chemical explosion | sonic boom | mine collapse |
# Check occurrences of each value
data_type_counts = data['data_type'].value_counts()
print("Occurrences of each value in the 'data_type' column:")
print_counts(data_type_counts, n=10)
Occurrences of each value in the 'data_type' column: -> earthquake | 3,361,846 (97.56%) -> quarry blast | 38,865 (1.13%) -> explosion | 26,571 (0.77%) -> ice quake | 13,838 (0.4%) -> mining explosion | 2,177 (0.06%) -> other event | 1,706 (0.05%) -> chemical explosion | 314 (0.01%) -> rock burst | 182 (0.01%) -> sonic boom | 104 (0.0%) -> nuclear explosion | 56 (0.0%) -> ... and 15 more (0.0%)
2.2.8. Magnitude column¶
Summary:
- The range of values in the magnitude column spans from -9.99 to 9.1, allowing for conversion to float16 to save memory.
- Earthquakes with magnitudes smaller than -3 are likely errors and should be excluded.
- The distribution of the magnitude column shows two peaks:
- A primary peak around 1.25.
- A secondary peak around 4.5.
- This distribution needs further investigation. This phenomenon is further explored in section 2.3.1. Analysis of the magnitude values in chosen regions.
- The largest magnitude earthquake in the dataset is 9.1, corresponding to the 2011 Tōhoku earthquake and tsunami. (https://en.wikipedia.org/wiki/2011_T%C5%8Dhoku_earthquake_and_tsunami)
- Although the dataset description indicates that both the Richter and moment magnitude scales are used, this should not pose a problem, as both scales should yield the same values for the same earthquakes—at least for smaller ones. The discrepancy between the scales typically becomes more noticeable with larger earthquakes, but for most of the dataset's entries, the values should be consistent.
# Print the information about the magnitude column
print_column_info(data['magnitudo'])
Basic information about "magnitudo" column:
- Data type: float64
- (max; min): (9.1; -9.99)
- Number of missing values: 0
- Number of unique values: 933
- Percentage of unique values: 0.03%
- Unique values (max 10): 2.5 | 1.41 | 1.11 | 0.98 | 2.95 | 2.77 | 1.13 | 0.83 | 1.59 | 2.2 |
The following histograms display the magnitude distribution in the dataset. The first histogram is displayed on a linear scale, while the second is displayed on a logarithmic scale. The distribution shows two peaks: a primary peak around 1.25 and a secondary peak around 4.5. Also, on the logarithmic scale, there are clearly visible outliers with magnitudes smaller than -3, which are likely errors and should be excluded. This distribution requires further investigation.
# Plot two histograms of the magnitude column (linear and log scale) next to each other
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
# Linear scale
axs[0].hist(data['magnitudo'], bins=100, alpha=1, color='blue')
axs[0].set_title('Magnitude Histogram [linear]')
axs[0].set_xlabel('Magnitude')
axs[0].set_ylabel('Number of Earthquakes')
axs[0].grid()
# Log scale
axs[1].hist(data['magnitudo'], bins=100, alpha=1, color='blue')
axs[1].set_title('Magnitude Histogram [log]')
axs[1].set_xlabel('Magnitude')
axs[1].set_ylabel('Number of Earthquakes [log]')
axs[1].set_yscale('log')
axs[1].grid()
plt.show()
# Print the earthquake with the largest magnitude
max_magnitude = max(data['magnitudo'])
max_magnitude_event = data[data['magnitudo'] == max_magnitude]
print(f"Earthquake with the largest magnitude:")
for event in max_magnitude_event.values:
print()
print(f" Event:")
for i in range(len(event)):
if isinstance(event[i], (int, float)):
print(f" - {data.columns[i]}: {prettify_number(round(event[i], 4))}")
else:
print(f" - {data.columns[i]}: {event[i]}")
Earthquake with the largest magnitude:
Event:
- time: 1,104,022,733,450
- place: 2004 Sumatra - Andaman Islands Earthquake
- status: reviewed
- tsunami: 0
- significance: 1,274
- data_type: earthquake
- magnitudo: 9.1
- state: 2004 Sumatra - Andaman Islands Earthquake
- longitude: 95.982
- latitude: 3.295
- depth: 30.0
- date: 2004-12-26 00:58:53.450000+00:00
Event:
- time: 1,299,822,384,120
- place: 2011 Great Tohoku Earthquake, Japan
- status: reviewed
- tsunami: 0
- significance: 2,184
- data_type: earthquake
- magnitudo: 9.1
- state: Japan
- longitude: 142.373
- latitude: 38.297
- depth: 29.0
- date: 2011-03-11 05:46:24.120000+00:00
2.2.9. State column¶
Summary:
- Contains inconsistent state names (e.g., 'California', 'CA', 'CA earthquake').
- The most common state recorded is 'California'.
# Print the information about the state column
print_column_info(data['state'])
Basic information about "state" column:
- Data type: object
- Number of missing values: 0
- Number of unique values: 858
- Percentage of unique values: 0.02%
- Unique values (max 10): Alaska | Hawaii | California | California | Washington | Greece | Italy | Albania | Utah | Fiji region |
# Check occurrences of each value
state_counts = data['state'].value_counts()
print("Occurrences of each value in the 'state' column:")
print_counts(state_counts, n=10)
Occurrences of each value in the 'state' column: -> California | 866,675 (25.15%) -> Alaska | 777,881 (22.58%) -> California | 491,689 (14.27%) -> Nevada | 174,065 (5.05%) -> Hawaii | 125,336 (3.64%) -> Washington | 80,483 (2.34%) -> Utah | 55,471 (1.61%) -> Montana | 53,426 (1.55%) -> Indonesia | 48,502 (1.41%) -> Puerto Rico | 46,275 (1.34%) -> ... and 848 more (0.02%)
2.2.10. Longitude column¶
Summary:
- No issues or noteworthy observations. The data appears as expected.
# Print the information about the longitude column
print_column_info(data['longitude'])
Basic information about "longitude" column:
- Data type: float64
- (max; min): (180.0; -179.9997)
- Number of missing values: 0
- Number of unique values: 733,599
- Percentage of unique values: 21.29%
- Unique values (max 10): -149.6692 | -155.2123 | -122.8062 | -118.8463 | -118.934 | -118.923 | -119.04 | -118.8342 | -118.8402 | -122.0712 |
2.2.11. Latitude column¶
Summary:
- No issues or noteworthy observations. The data appears as expected.
# Print the information about the latitude column
print_column_info(data['latitude'])
Basic information about "latitude" column:
- Data type: float64
- (max; min): (87.386; -84.422)
- Number of missing values: 0
- Number of unique values: 518,295
- Percentage of unique values: 15.040000000000001%
- Unique values (max 10): 61.7302 | 19.3177 | 38.821 | 37.6643 | 34.546 | 34.543 | 37.6327 | 37.6615 | 37.6623 | 47.1025 |
The following map displays the coordinates of the events in the dataset. The events are color-coded based on their magnitude, with larger magnitudes represented by darker shades of red.
# Plot the coordinates of the events on a world map
plot_map_with_region(data, None, "Earthquakes", zoom=False, every_nth=1, size=2)
2.2.12. Depth column¶
Summary:
- The depth is measured in kilometers, based on the highest value of 735.8 km, which corresponds to the deepest recorded earthquake. (https://en.wikipedia.org/wiki/Deep-focus_earthquake))
- The range of the depth column spans from -10 to 735.8 km, allowing for conversion to int16 to save memory.
- The distribution is highly skewed to the left, indicating that most earthquakes have low depth.
- The negative values in the depth likely correspond to the earthquakes' epicenters being above sea level.
- Earthquakes with depths smaller than -4 km are most likely errors and should be excluded.
# Print the information about the depth column
print_column_info(data['depth'])
Basic information about "depth" column:
- Data type: float64
- (max; min): (735.8; -10.0)
- Number of missing values: 0
- Number of unique values: 78,386
- Percentage of unique values: 2.27%
- Unique values (max 10): 30.1 | 6.585 | 3.22 | -0.584 | 16.122 | 16.342 | -1.499 | 0.556 | -0.234 | 8.382 |
The following histograms display the depth distribution in the dataset. The first histogram is displayed on a linear scale, while the second is displayed on a logarithmic scale. The distribution clearly shows that most earthquakes have low depth.
# Plot two histograms of the depth column (linear and log scale) next to each other
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
# Linear scale
axs[0].hist(data['depth'], bins=100, alpha=1, color='blue')
axs[0].set_title('Depth Histogram [linear]')
axs[0].set_xlabel('Depth')
axs[0].set_ylabel('Number of Earthquakes')
axs[0].grid()
# Log scale
axs[1].hist(data['depth'], bins=100, alpha=1, color='blue')
axs[1].set_title('Depth Histogram [log]')
axs[1].set_xlabel('Depth')
axs[1].set_ylabel('Number of Earthquakes [log]')
axs[1].set_yscale('log')
axs[1].grid()
plt.show()
The following histogram displays the depth distribution in the dataset. The histogram includes only the earthquakes with depths smaller than 0 km and shows some outliers with depths smaller than -4 km. These outliers are likely errors and should be excluded.
# Print the depth histogram with the earthquakes with depth smaller than 0 km
plt.figure(figsize=(10, 5))
plt.hist(data[data['depth'] < 0]['depth'], bins=100, alpha=1, color='blue')
plt.title('Depth Histogram with earthquakes with depth < 0 km [log]')
plt.xlabel('Depth')
plt.ylabel('Number of Earthquakes [log]')
plt.grid()
plt.yscale('log')
plt.show()
The following map displays the coordinates of the events with depths larger than 500 km.
# Plot the coordinates of the events with the depths larger than 500 km on a world map
deep_data = data[data['depth'] > 500]
plot_map_with_region(deep_data, None, "Earthquakes with depth > 500 km", zoom=False, every_nth=1, size=2)
The following map displays the coordinates of the events with depths smaller than -4 km (outliers). It shows that these events occur only in a specific region of Alaska.
# Plot the coordinates of the events with the depths smaller than -4 km on a world map
shallow_data = data[data['depth'] < -4]
plot_map_with_region(shallow_data, None, "Earthquakes with depth < -4 km", zoom=False, every_nth=1, size=10)
2.3. Advanced dataset exploration¶
This section explores the columns selected in the previous sections that are most relevant to this work. The analysis focuses on the magnitude, significance, and depth columns.
Summary:
- It is reasonable to filter out earthquakes with magnitudes smaller than 0.
- The double peak distribution of the magnitudes can be explained by a theory based on insights extracted from the data.
- The magnitude values, reinforced by the depth values, can be used as features in a machine learning model to predict the significance level of an earthquake.
# Load the dataset
path = '../data/earthquakes_prepared_old.parquet'
data = pd.read_parquet(path)
# Define helper functions
def get_region_data(data: pd.DataFrame, region: tuple) -> pd.DataFrame:
left, right, bottom, top = region
if left < right:
region_data = data[(data['longitude'] >= left) & (data['longitude'] <= right) & (data['latitude'] >= bottom) & (data['latitude'] <= top)]
else:
region_data = data[((data['longitude'] <= right) | (data['longitude'] >= left)) & (data['latitude'] >= bottom) & (data['latitude'] <= top)]
return region_data
# Define a function to plot the histograms for each region
def plot_graphs_for_regions(data: pd.DataFrame, regions: dict, column: str, xlabel: str, log_scale: bool = True,
fig_columns: int = 3, bins: int = 100) -> (dict, dict):
max_values = data[column].max()
min_values = data[column].min()
max_values = max_values + 1 if column != 'date' else max_values + timedelta(days=1)
min_values = min_values - 1 if column != 'date' else min_values - timedelta(days=1)
fig, axs = plt.subplots(6, 3, figsize=(15, 30))
averages = []
medians = []
max_average = -np.inf
min_average = np.inf
max_median = -np.inf
min_median = np.inf
max_average_region = ''
min_average_region = ''
max_median_region = ''
min_median_region = ''
for i, (region, coords) in enumerate(regions.items()):
row = i // fig_columns
col = i % fig_columns
region_data = get_region_data(data, coords)
axs[row, col].hist(region_data[column], bins=bins)
axs[row, col].set_title(region)
axs[row, col].set_xlabel(xlabel)
if log_scale:
axs[row, col].set_yscale('log')
axs[row, col].set_ylabel('Frequency [log]')
else:
axs[row, col].set_ylabel('Frequency')
if column != 'date':
average = region_data[column].astype('float64').mean() if column != 'date' else region_data[column].mean()
axs[row, col].axvline(average, color='red', linestyle='--', label=f'Average: {average:.2f}', linewidth=2)
averages.append(average)
median = region_data[column].astype('float64').median() if column != 'date' else region_data[column].median()
axs[row, col].axvline(median, color='green', linestyle='--', label=f'Median: {median:.2f}')
medians.append(median)
else:
max_3_earthquakes = region_data.nlargest(3, 'magnitudo')
max_magnitude_date_first = max_3_earthquakes.iloc[0]['date']
max_magnitude_date_second = max_3_earthquakes.iloc[1]['date']
max_magnitude_date_third = max_3_earthquakes.iloc[2]['date']
axs[row, col].axvline(max_magnitude_date_first, color='red', linestyle='--', label=f'MAX: 1st', linewidth=2)
axs[row, col].axvline(max_magnitude_date_second, color='green', linestyle='--', label=f'MAX: 2nd', linewidth=2)
axs[row, col].axvline(max_magnitude_date_third, color='blue', linestyle='--', label=f'MAX: 3rd', linewidth=2)
axs[row, col].grid()
axs[row, col].legend()
axs[row, col].set_xlim([min_values, max_values])
if column != 'date' and average > max_average:
max_average = average
max_average_region = region
if column != 'date' and average < min_average:
min_average = average
min_average_region = region
if column != 'date' and median > max_median:
max_median = median
max_median_region = region
if column != 'date' and median < min_median:
min_median = median
min_median_region = region
plt.show()
average_info = {
'all': averages,
'max': max_average,
'min': min_average,
'max_region': max_average_region,
'min_region': min_average_region
}
median_info = {
'all': medians,
'max': max_median,
'min': min_median,
'max_region': max_median_region,
'min_region': min_median_region
}
return average_info, median_info
2.3.1. Analysis of the magnitude values in chosen regions¶
Earthquake attributes are heavily influenced by the region in which they occur, as different regions present varying environmental conditions such as tectonic plate boundaries and seismic activity. It is therefore reasonable to expect that earthquakes in different regions will exhibit different characteristics. In this section, the main goal is to analyze the magnitude values in different regions to identify potential patterns or discrepancies. The regions are defined by a rectangle with the following coordinates: (left, right, bottom, top) = (longitude, longitude, latitude, latitude).
Summary:
- Based on the outcome graphs, it seems advisable to filter out earthquakes with magnitudes smaller than 0.
- The double peak distribution of the magnitudes observed in section 2.2.8. Magnitude column may be explained by the theory that different regions register earthquakes with varying sensitivities to seismic activity.
- The outlier magnitude values observed in section 2.2.8. Magnitude column are likely placeholder values for invalid or negligible magnitudes. These placeholder values seem to be different for each region, making it challenging to accurately filter them out without detailed regional metadata.
# Define the regions
regions = {
'Alaska': (172, -129, 51, 72),
'California': (-124.7, -113.9, 32.3, 42.2),
'USA': (-124.736342, -66.945392, 24.521208, 49.382808),
'Chile': (-81.5, -61.5, -56.5, -17.4),
'South America': (-85, -35, -60, 15),
'Mexico': (-118.5, -86.3, 14.2, 33.3),
'Japan': (121.96, 151.58, 22.95, 47.56),
'Indonesia': (94.5, 141.5, -11.5, 6.5),
'Europe': (-10, 30, 35, 70),
'Himalayas': (70, 90, 25, 40),
'New Zealand': (165.9, 179.1, -47.8, -33.9),
'Hawaii': (-160.7, -154.6, 18.7, 22.4),
'South Pacific': (160, -100, -60, 0),
'South-West Asia': (75, 140, 25, 50),
'Africa': (-20, 55, -35, 37),
'Middle East': (25, 60, 10, 40),
'Atlantic': (-50, -10, -65, 60),
'Pacific Ring of Fire': (120, -70, -60, 60)
}
# Plot example region
print("Example region - Atlantic:")
plot_map_with_region(data, regions['Atlantic'], "Atlantic", zoom=False, every_nth=1, size=2)
Example region - Atlantic:
Based on the following graphs, it appears that different regions use distinct placeholder values for the magnitude of earthquakes that are considered invalid or negligible. For example, in the USA, this value is set at -9.99, while in Chile, it is set at 0. Both values fall outside the main range of valid magnitudes. This discrepancy accounts for the outliers with magnitude values smaller than -3 that were identified earlier in the section 2.2.8. Magnitude column.
To improve the dataset, it is advisable to remove these placeholder events. However, this task is complicated by the fact that different regions use different placeholder values. For instance, the main range of valid magnitudes for California overlaps with the placeholder value used in Chile. Without detailed regional metadata, it is impossible to accurately determine the appropriate threshold for filtering.
Given this limitation, the most practical solution is to exclude all events with magnitude values smaller than 0, as these correspond to earthquakes with negligible magnitudes.
region_magnitudo_averages, region_magnitudo_medians = plot_graphs_for_regions(data, regions, 'magnitudo', 'Magnitude', log_scale=True, bins=50)
# Print maximum average and median magnitudes
print(f"Maximum average magnitude: {region_magnitudo_averages['max']:.2f} for {region_magnitudo_averages['max_region']}")
print(f"Maximum median magnitude: {region_magnitudo_medians['max']:.2f} for {region_magnitudo_medians['max_region']}")
# Print minimum average and median magnitudes
print(f"Minimum average magnitude: {region_magnitudo_averages['min']:.2f} for {region_magnitudo_averages['min_region']}")
print(f"Minimum median magnitude: {region_magnitudo_medians['min']:.2f} for {region_magnitudo_medians['min_region']}")
Maximum average magnitude: 4.55 for Atlantic Maximum median magnitude: 4.60 for Atlantic Minimum average magnitude: 1.12 for California Minimum median magnitude: 1.08 for California
The following graph shows the histograms of the average and median magnitudes for different regions. There are two main peaks in the distribution of the magnitudes: one around the value 1.25 and the other around the value 4.5. This corresponds to the two peaks observed in the overall magnitude distribution is section 2.2.8. Magnitude column. The explanation for the two peaks in the overall magnitude distribution is therefore likely to be related to the fact that different regions register earthquakes with different sensitivity.
# Plot histograms for the averages and medians
plt.hist(region_magnitudo_averages['all'], bins=10, color='blue', alpha=0.5, label='Average')
plt.hist(region_magnitudo_medians['all'], bins=10, color='red', alpha=0.5, label='Median')
plt.title('Histogram of the average and median magnitudes for different regions')
plt.xlabel('Magnitude')
plt.ylabel('Frequency')
plt.legend()
plt.grid()
plt.show()
2.3.2. Correlation between the significance and magnitude values¶
Summary:
- The correlation between the magnitude and significance values is 0.94, indicating a strong positive correlation.
- Based on this strong correlation, it is reasonable to assume that the magnitude value can be used to predict the significance of an earthquake.
- The peak in the significance distribution around the value of 250, observed in section 2.2.6 Significance Column, can be explained by the correlation with the magnitude values. As noted in the previous section, the magnitude distribution contains a secondary peak, and this relationship likely accounts for the observed peak in significance.
# Calculate the correlation between magnitude and significance
magnitude_significance_correlation = data['magnitudo'].corr(data['significance'])
print(f"Correlation between magnitude and significance: {magnitude_significance_correlation:.2f}")
Correlation between magnitude and significance: 0.94
The following scatter plot displays the relationship between the magnitude and significance values. The plot shows a strong positive correlation between the two variables, with a correlation coefficient of 0.94.
# Plot the scatter plot between magnitude and significance
plt.scatter(data['magnitudo'], data['significance'], s=1)
plt.title('Magnitude vs Significance')
plt.xlabel('Magnitude')
plt.ylabel('Significance')
plt.grid()
plt.show()
The following scatter plot displays the relationship between the magnitude and significance values on a logarithmic scale. The plot shows a strong positive correlation between the two variables, with a correlation coefficient of 0.94.
# Plot the scatter plot between magnitude and significance with log scale
plt.scatter(data['magnitudo'], data['significance'], s=1)
plt.title('Magnitude vs Significance [Log]')
plt.xlabel('Magnitude [Log]')
plt.ylabel('Significance [Log]')
plt.xscale('log')
plt.yscale('log')
plt.grid()
plt.show()
2.3.3. Correlation between the significance and depth values¶
Summary:
- The correlation between the depth and significance values is 0.36, indicating a weak positive correlation.
- Based on this correlation, it is reasonable to assume that the depth value is not as strong a predictor for the significance of an earthquake as the magnitude value. However, it still retains some predictive power.
# Calculate the correlation between depth and significance
depth_significance_correlation = data['depth'].corr(data['significance'])
print(f"Correlation between depth and significance: {depth_significance_correlation:.2f}")
Correlation between depth and significance: 0.36
The following scatter plot displays the relationship between the depth and significance values. The plot shows a weak positive correlation between the two variables, with a correlation coefficient of 0.36.
# Plot the scatter plot between depth and significance
plt.scatter(data['depth'], data['significance'], s=1)
plt.title('Depth vs Significance')
plt.xlabel('Depth')
plt.ylabel('Significance')
plt.grid()
plt.show()
The following scatter plot displays the relationship between the depth and significance values on a logarithmic scale. The plot shows a weak positive correlation between the two variables, with a correlation coefficient of 0.36.
# Plot the scatter plot between depth and significance with log scale
plt.scatter(data['depth'], data['significance'], s=1)
plt.title('Depth vs Significance [Log]')
plt.xlabel('Depth [Log]')
plt.ylabel('Significance [Log]')
plt.xscale('log')
plt.yscale('log')
plt.grid()
plt.show()
3. Data preprocessing¶
This section covers the data cleaning and transformation steps necessary to prepare the dataset for model training. The following steps will be performed:
- Data cleaning
- Data transformation
3.1. Data cleaning¶
Several data cleaning steps are necessary to prepare the dataset. The following steps will be performed:
- Remove duplicated rows.
- Remove the time, state, place, and status columns.
- Keep only rows with 'earthquake' in the data_type column. Then remove the data_type column.
- Remove newly created duplicates.
- Remove the date, tsunami, longitude, and latitude columns.
- Remove rows with magnitude values smaller than 0.
- Remove rows with depth values smaller than -4 km.
# Load the dataset
path = '../data/earthquakes_prepared.parquet'
data = pd.read_parquet(path)
print("Columns after cleaning:")
for col in data.columns:
print(" ", col)
Columns after cleaning: significance magnitudo depth
# Print the number of rows
data_percentage = len(data) / original_number_of_rows * 100
print(f"\nNumber of rows after cleaning: {prettify_number(len(data))} ({data_percentage:.2f}% of the original dataset)")
Number of rows after cleaning: 3,257,048 (94.52% of the original dataset)
# Print the random 5 rows of the dataset
print("\n5 random rows after cleaning:")
print(data.sample(5))
5 random rows after cleaning:
significance magnitudo depth
50792 20 1.139648 12.031250
2073872 22 1.200195 5.699219
3223058 19 1.099609 69.312500
796313 45 1.709961 8.539062
1677548 4 0.479980 0.793945
3.2. Data transformation¶
The following data transformations will be performed:
- Significance will be converted to a categorical column with values (0, 1, 2) representing (low, medium, high). The boundaries for categorization are set manually at 250 and 750.
- The magnitude and depth columns will be normalized.
- The magnitude column will be converted to float16.
- The depth column will be converted to float16.
# Convert the significance column to a categorical column
# Replace the significance column with 0 for 'low', 1 for 'medium', and 2 for 'high' significance
data['original_significance'] = data['significance']
# Get boundaries for each significance level
max_significance = data['significance'].max()
min_significance = data['significance'].min()
boundaries = [min_significance, 250, 750, max_significance]
# Split the significance values into 3 categories
data['significance'] = pd.cut(data['significance'], bins=boundaries, labels=[0, 1, 2], include_lowest=True)
The following histogram displays the original significance values colored by the new significance categories. The plot shows that the new significance categories are based on the original significance values, with the boundaries set at 250 and 750.
# Plot the histogram of the original significance colored by the new significance
fig, ax = plt.subplots()
for i in range(3):
plt.hist(data[data['significance'] == i]['original_significance'], bins=50, alpha=0.5, label=f'New Significance: {i}')
plt.yscale('log')
plt.title('Original Significance Histogram - Colored by New Significance categories')
plt.xlabel('Significance')
plt.ylabel('Number of Earthquakes [log]')
plt.legend()
plt.grid()
plt.show()
The magnitude and depth columns will be normalized to a range between 0 and 1. The magnitude column will be multiplied by 10 to increase its importance as a predictor.
# Normalize the magnitude and depth columns (make it between 0 and 1)
data['magnitudo'] = (data['magnitudo'] - data['magnitudo'].min()) / (data['magnitudo'].max() - data['magnitudo'].min())
data['depth'] = (data['depth'] - data['depth'].min()) / (data['depth'].max() - data['depth'].min())
# As the magnitude is much stronger predictor than the depth, multiply the magnitude by 10
data['magnitudo'] = data['magnitudo'] * 10
# Print max and min values of the magnitude and depth columns
print("Magnitude - min:", data['magnitudo'].min(), "max:", data['magnitudo'].max())
print("Depth - min:", data['depth'].min(), "max:", data['depth'].max())
Magnitude - min: 0.0 max: 10.0 Depth - min: 0.0 max: 1.0
# Drop the original significance column
if 'original_significance' in data.columns:
data = data.drop(columns=['original_significance'])
# Print the data types of the columns
print("Data types of the columns after transformation:")
for col in data.columns:
print(" ", col, ":", data[col].dtype)
Data types of the columns after transformation: significance : category magnitudo : float16 depth : float16
4. Model engineering¶
This section covers the creation of a machine learning model - a k-Nearest Neighbors classifier - to predict the significance of an earthquake based on its magnitude and depth. The following steps will be performed:
- Preparing training and testing datasets
- Training the model
- Testing the model
4.1. Preparing training and testing datasets¶
This section covers the preparation of the training and testing datasets for the machine learning model. The key problem is the imbalanced distribution of the significance values. To address this issue, under-sampling will be performed to create a balanced dataset.
The following numbers of occurrences of each significance value show the large imbalance in the dataset.
# Show the imbalance in the significance column
significance_counts = data['significance'].value_counts()
print("Occurrences of each significance value:")
print_counts(significance_counts)
Occurrences of each significance value: -> 0 | 2,900,678 (89.06%) -> 1 | 355,329 (10.91%) -> 2 | 1,041 (0.03%)
# Split the data into features and target
X = data.drop(columns=['significance'])
y = data['significance']
# Print the shapes of the features and target
pretty_shape_features = '; '.join([prettify_number(x) for x in X.shape])
pretty_shape_target = prettify_number(y.shape[0])
print(f"Shape of the features: ({pretty_shape_features})")
print(f"Shape of the target: ({pretty_shape_target})")
Shape of the features: (3,257,048; 2) Shape of the target: (3,257,048)
To address the imbalance in the dataset, the under-sampling technique will be used. The dataset will be split into training and testing datasets in a custom way:
- The features and targets for significance 2 will be split into an 80/20 ratio for training and testing datasets.
- For training datasets, only four times more significance 0 and two times more significance 1 than significance 2 will be kept. The rest will be issued to the testing datasets.
- The training and testing datasets will be concatenated.
# Split the data into training and testing datasets in custom way
# - Get the significance values
y_0 = y[y == 0]
X_0 = X[y == 0]
y_1 = y[y == 1]
X_1 = X[y == 1]
y_2 = y[y == 2]
X_2 = X[y == 2]
# - Split the features and targets for significance 2 in 80/20 ratio
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.2, random_state=42)
# - Of the significance 0, keep only 4 times more than the significance 2. Of the significance 1, keep only 2 times more than the significance 2. Of the significance 2, keep all.
num_0 = len(y_2) * 4
num_1 = len(y_2) * 2
X_train_0, X_test_0, y_train_0, y_test_0 = train_test_split(X_0, y_0, test_size=(len(y_0) - num_0) / len(y_0), random_state=42)
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_1, test_size=(len(y_1) - num_1) / len(y_1), random_state=42)
# - Concatenate the training and testing datasets
X_train = pd.concat([X_train_0, X_train_1, X_train_2])
X_test = pd.concat([X_test_0, X_test_1, X_test_2])
y_train = pd.concat([y_train_0, y_train_1, y_train_2])
y_test = pd.concat([y_test_0, y_test_1, y_test_2])
# Print the shapes of the features and target
pretty_shape_features_train = '; '.join([prettify_number(x) for x in X_train.shape])
pretty_shape_target_train = prettify_number(y_train.shape[0])
pretty_shape_features_test = '; '.join([prettify_number(x) for x in X_test.shape])
pretty_shape_target_test = prettify_number(y_test.shape[0])
print(f"Shape of the training features: ({pretty_shape_features_train})")
print(f"Shape of the training target: ({pretty_shape_target_train})")
print(f"Shape of the testing features: ({pretty_shape_features_test})")
print(f"Shape of the testing target: ({pretty_shape_target_test})")
Shape of the training features: (7,078; 2) Shape of the training target: (7,078) Shape of the testing features: (3,249,970; 2) Shape of the testing target: (3,249,970)
4.2. Train the model¶
In this section, a k-Nearest Neighbors classifier will be trained to predict the significance of an earthquake based on its magnitude and depth.
# Train the model
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X_train, y_train)
print('Model trained successfully! (n_neighbors=2)')
Model trained successfully! (n_neighbors=2)
4.3. Test the model¶
In this section the model will be tested on the testing dataset. The following metrics will be calculated: accuracy, precision, recall, and F1 score. The metric values will be calculated for the whole testing dataset and for each significance value separately.
# Make predictions for whole testing dataset
y_pred = model.predict(X_test)
print("Predictions made successfully!")
Predictions made successfully!
The metrics for the whole testing dataset are very promising, with high values for accuracy, precision, recall, and F1 score. But it is important to analyze the metrics for each significance value separately to get a more detailed understanding of the model's performance because the dataset is imbalanced.
# Print the metrics for whole testing dataset
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print("Metrics for the whole testing dataset:")
print(f" - Accuracy: {accuracy:.5f}")
print(f" - Precision: {precision:.5f}")
print(f" - Recall: {recall:.5f}")
print(f" - F1: {f1:.5f}")
Metrics for the whole testing dataset:
- Accuracy: 0.99688
- Precision: 0.99903
- Recall: 0.99688
- F1: 0.99792
def print_metrics_for_selected_significance_value(y_test: pd.Series, y_pred: np.ndarray, significance_value: int) -> None:
y_test_selected = y_test[y_test == significance_value]
y_pred_selected = y_pred[y_test == significance_value]
accuracy = accuracy_score(y_test, y_pred)
# Calculate true positives, false positives, and false negatives
tp = np.sum((y_test_selected == significance_value) & (y_pred_selected == significance_value))
fp = np.sum((y_test_selected != significance_value) & (y_pred_selected == significance_value))
fn = np.sum((y_test_selected == significance_value) & (y_pred_selected != significance_value))
# Calculate precision, recall, and F1
precision = tp / (tp + fp) if tp + fp > 0 else 0
recall = tp / (tp + fn) if tp + fn > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
print(f"Metrics for significance {significance_value}:")
print(f" - Accuracy: {accuracy:.5f}")
print(f" - Precision: {precision:.5f}")
print(f" - Recall: {recall:.5f}")
print(f" - F1: {f1:.5f}")
# Print the metrics for significance 0
print_metrics_for_selected_significance_value(y_test, y_pred, 0)
Metrics for significance 0:
- Accuracy: 0.99688
- Precision: 1.00000
- Recall: 0.99948
- F1: 0.99974
# Print the metrics for significance 1
print_metrics_for_selected_significance_value(y_test, y_pred, 1)
Metrics for significance 1:
- Accuracy: 0.99688
- Precision: 1.00000
- Recall: 0.97575
- F1: 0.98772
# Print the metrics for significance 2
print_metrics_for_selected_significance_value(y_test, y_pred, 2)
Metrics for significance 2:
- Accuracy: 0.99688
- Precision: 1.00000
- Recall: 0.79904
- F1: 0.88830
5. Conclusion¶
First, it is important to address the hypotheses formulated in the Section 1.2. Hypotheses formulation:
A. Higher magnitude earthquakes are more likely to be classified with higher significance.
Answer: Based on the strong positive correlation of 0.94 between magnitude and significance, as discussed in Section 2.3.2 Correlation between the significance and magnitude values, the data provides strong support for the hypothesis that higher magnitude earthquakes are more likely to be classified with higher significance. Thus, hypothesis A is supported by the data.
B. Earthquakes with lower depth are more likely to be classified with higher significance.
Answer: Given the weak positive correlation of 0.36 between depth and significance, as observed in Section 2.3.3 Correlation between the significance and magnitude values, the data provides some evidence that earthquakes with lower depth may be more likely to be classified with higher significance. However, due to the relatively weak correlation, the support for hypothesis B is limited. Thus, hypothesis B is somewhat supported by the data, but the evidence is not strong enough for a definitive conclusion.
Second, it is important to summarize the data exploration conducted in Section 2. Data Exploration. Each data column was thoroughly analyzed, uncovering valuable insights and potential issues. Relevant columns were identified and examined in greater detail to ensure a comprehensive understanding of the dataset, which is crucial for addressing the hypotheses and training the machine learning model.
Third, we summarize the results of the machine learning model developed in Section 4. Model Engineering. The k-Nearest Neighbors classifier was trained to predict earthquake significance based on magnitude and depth. The model achieved the following performance metrics:
# Define the data
data = {
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'Whole Testing Dataset': [0.99688, 0.99903, 0.99688, 0.99792],
'Significance 0': [0.99688, 1.00000, 0.99948, 0.99974],
'Significance 1': [0.99688, 1.00000, 0.97575, 0.98772],
'Significance 2': [0.99688, 1.00000, 0.79904, 0.88830]
}
# Create DataFrame
df = pd.DataFrame(data)
# Display the table
df
| Metric | Whole Testing Dataset | Significance 0 | Significance 1 | Significance 2 | |
|---|---|---|---|---|---|
| 0 | Accuracy | 0.99688 | 0.99688 | 0.99688 | 0.99688 |
| 1 | Precision | 0.99903 | 1.00000 | 1.00000 | 1.00000 |
| 2 | Recall | 0.99688 | 0.99948 | 0.97575 | 0.79904 |
| 3 | F1 Score | 0.99792 | 0.99974 | 0.98772 | 0.88830 |
NOTE: The precision for the entire dataset is not 1, even though the precision values for each individual class are 1. This discrepancy is likely due to the method used for calculating the weighted average precision, or it may stem from rounding errors in the precision values.
Overall, the model achieved exceptional performance, with the highest precision for predicting significance level 0 (low). As the significance level increased, recall decreased, likely due to the lower frequency of high-significance earthquakes in the dataset. However, precision remained perfect across all levels, meaning the model was always correct when it predicted a specific significance. The F1 score reflected this balance, showing slightly lower values for higher significance levels where recall was less consistent.